Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

(b) and (c), respectively. It can be seen that the data presented in

29 (a) may have the worst clustering performance and the data

d in Figure 2.29 (c) may have the best clustering performance.

n the comparison between three panels, it can be seen that the

he between-cluster distance, the better the discrimination power

ing performance.

(a) (b) (c)

Three scenarios to show the impact of the between-cluster variance on the

performance. The dots stand for the data points and the triangles stand for the

res. ‘Sb’ stands for the between-cluster sum of squares.

utputs of the kmeans function include several sums of squares.

ut named as withinss is a vector and is called the within-

m of squares. Each entry of the vector is the variance of a cluster,

within-cluster variance. Such a variance is the sum of the squared

between the centre of a cluster and all data points which have

sified into the cluster. The output named as tot.withinss is

of withinss and stands for the total within-cluster variance. It

d by ܵௐ

ଶ. The output named as betweenss stands for the sum

uared distances between the cluster centres and is named as the

cluster variance. It is denoted by ܵ஻

ଶ.

mmary, two statistics (tot.withinss or ܵௐ

ଶ and betweenss

n be used to assess the performance of a cluster model. Based on

஻

ଶ, the well-known F statistic used in ANOVA can be considered

mising a K-means model structure. The F statistic is defined as

r which a p value can be calculated for the significance evaluation,